Building fully reproducible data science environments for R and python with ease using nix, rix, and Docker
Thing I want to talk about
What I mean by reproducibility
What is Nix, how it works and its complementary relationship to Docker
What I will not discuss (but is very useful!):
- FP, Git, Documenting, testing and packaging code, build automation with
{targets}
What I mean by reproducibility
- Ability to recover exactly the same results from an analysis
- Why would you want that?
- Auditing purposes
- Update of data (only impact must be from data update)
- Reproducibility as a cornerstone of science
- (Work on an immutable dev environment)
Turning our scripts reproducible
We need to answer these questions
- How easy would it be for someone else to rerun the analysis?
- How easy would it be to update the project?
- How easy would it be to reuse this code for another project?
- What guarantee do we have that the output is stable through time?
Reproducibility is on a continuum (1/2)
Here are the 4 main things influencing an analysis’ reproducibility:
- Version of R used
- Versions of packages used
- Operating system
- Hardware
Reproducibility is on a continuum (2/2)
![]()
Source: Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27
Package versioning with renv
renv is commonly used
- Run
renv::init() to generate library snapshot as a renv.lock file
What an renv.lock file looks like
{
"R": {
"Version": "4.2.2",
"Repositories": [
{
"Name": "CRAN",
"URL": "https://packagemanager.rstudio.com/all/latest"
}
]
},
"Packages": {
"MASS": {
"Package": "MASS",
"Version": "7.3-58.1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "762e1804143a332333c054759f89a706",
"Requirements": []
},
"Matrix": {
"Package": "Matrix",
"Version": "1.5-1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "539dc0c0c05636812f1080f473d2c177",
"Requirements": [
"lattice"
]
***and many more packages***
Restoring a library using an renv.lock file
renv.lock file not just a record
- Can be used to restore as well!
- Run
renv::restore()
{renv} alone by itself is not enough
Shortcomings:
- Records, but does not restore the version of R
- Installation of old packages can fail (due to missing OS-dependencies)
- Generating a
renv.lock file is “free”
- Provides a blueprint for dockerizing our pipeline
- Creates a project-specific library (no interferences)
Going further with Docker: handling R and system-level dependencies
- Docker is a containerisation tool
- Docker allows you to build images and run containers (a container is an instance of an image)
- Docker images:
- contain all the software and code needed for your project
- are immutable (cannot be changed at run-time)
- can be shared on- and offline
Without Docker
With Docker
Dockerizing a project (1/3)
Dockerizing a project could look like this:
- At image build-time:
- install R (or use an image that ships R)
- install packages (using our
renv.lock file)
- copy all scripts to the image
- run the analysis using
targets::tar_make()
- At container run-time:
- copy the outputs of the analysis from the container to your computer
- possible to “log-in” into a running container to inspect code and outputs
Dockerizing a project (2/3)
- Restoring package with
{renv} can be tricky:
#> * installing *source* package ‘ModelMetrics’ ...
#> ** package ‘ModelMetrics’ successfully unpacked and MD5 sums checked
#> ** using staged installation
#> ** libs
#> /usr/bin/clang++ -std=gnu++11 -I"/opt/R-devel/lib64/R/include" -DNDEBUG -I'/home/docker/R/Rcpp/include' -I/usr/local/include -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o
#> /usr/bin/clang++ -std=gnu++11 -I"/opt/R-devel/lib64/R/include" -DNDEBUG -I'/home/docker/R/Rcpp/include' -I/usr/local/include -fpic -g -O2 -c auc_.cpp -o auc_.o
#> auc_.cpp:2:10: fatal error: 'omp.h' file not found
#> #include
#> ^~~~~~~
#> 1 error generated.
#> make: *** [/opt/R-devel/lib64/R/etc/Makeconf:178: auc_.o] Error 1
#> ERROR: compilation failed for package ‘ModelMetrics’
Dockerizing a project (2/3)
- The older the
renv.lock file, the harder to restore!
- Gets very complicated if you add Python and/or other tools.
Dockerizing a project (3/3)
- ALSO! Image build process not reproducible per se, only running containers is
- YOU need to make sure build process is reproducible (or store the built images)
- Need to fix version of R
- Base image layer becomes unsupported at some point
Docker conclusion
- Docker is very useful and widely used
- But the entry cost is high (familiarity with Linux is recommended)
- Single point of failure (what happens if Docker gets bought, abandoned, etc? quite unlikely though)
- Not actually dealing with reproducibility per se, we’re “abusing” Docker in a way
The Nix package manager (1/2)
Package manager: tool to install and manage packages
Package: any piece of software (not just R packages)
A popular package manager:
The Nix package manager (2/2)
- Reproducibility: R, R packages and other dependencies must be managed
- Nix deals with everything, with one single text file (called a Nix expression)!
- Nix is a package manager actually focused on reproducible builds
- These Nix expressions always build the exact same output
A basic Nix expression (1/6)
let
pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};
system_packages = builtins.attrValues {
inherit (pkgs) R ;
};
in
pkgs.mkShell {
buildInputs = [ system_packages ];
shellHook = "R --vanilla";
}
There’s a lot to discuss here!
A basic Nix expression (2/6)
- Written in the Nix language (not discussed)
- Defines the repository to use (with a fixed revision)
- Lists packages to install
- Defines the output: a development shell
A basic Nix expression (3/6)
- Software for Nix is defined as a mono-repository of tens of thousands of expressions on Github
- Github: we can use any commit to pin package versions for reproducibility!
- For example, the following commit installs R 4.3.1 and associated packages:
pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};
A basic Nix expression (4/6)
system_packages: a variable that lists software to install
- In this case, only R:
system_packages = builtins.attrValues {
inherit (pkgs) R ;
};
A basic Nix expression (5/6)
- Finally, we define a shell:
pkgs.mkShell {
buildInputs = [ system_packages ];
shellHook = "R --vanilla";
}
- This shell will come with the software defined in
system_packages (buildInputs)
- And launch
R --vanilla when started (shellHook)
A basic Nix expression (6/6)
- Writing these expressions requires learning a new language
- While incredibly powerful, if all we want are per-project reproducible dev shells…
- …then
{rix} will help!
Nix expressions
- Nix expressions can be used to install software
- But we will use them to build per-project development shells
- We will include R, LaTeX packages, or Quarto, Python, Julia….
- Nix takes care of installing every dependency down to the compiler!
rix: reproducible development environments with Nix (1/5)
{rix} (website) makes writing Nix expression easy!
- Simply use the provided
rix() function:
library(rix)
rix(
date = "2025-02-17",
r_pkgs = "ggplot2",
py_conf = list(
py_version = "3.12",
py_pkgs = c("polars", "great-tables")
),
overwrite = TRUE
)
rix: reproducible development environments with Nix (2/5)
renv.lock files can also be used as starting points:
library(rix)
renv2nix(
renv_lock_path = "path/to/original/renv_project/renv.lock",
project_path = "path/to/rix_project",
override_r_ver = "4.4.1" # <- optional
)
rix: reproducible development environments with Nix (3/5)
- List required R version and packages
- Optionally: more system packages, packages hosted on Github, or LaTeX packages
- Optionally: an IDE (Rstudio, Radian, VS Code or “other”)
- Work interactively in an isolated, project-specific and reproducible environment!
rix: reproducible development environments with Nix (4/5)
- Time for a demonstration, see
scripts/nix_expressions/docker/
- First outside of Docker
- Then we dockerize
rix: reproducible development environments with Nix (5/5)
- Can install specific versions of packages (write
"dplyr@1.0.0")
- Can install packages hosted on Github
- Many vignettes to get you started! See here
Let’s check out scripts/nix_expressions/rix_intro/
Non-interactive use
{rix} makes it easy to run pipelines in the right environment
- (Little side note: the best tool to build pipelines in R is
{targets})
- See
scripts/nix_expressions/nix_targets_pipeline
- Can also run the pipeline like so:
cd /absolute/path/to/pipeline/ && nix-shell default.nix --run "Rscript -e 'targets::tar_make()'"
Nix and Github Actions: running pipelines
- Possible to easily run a
{targets} pipeline on Github actions
- Simply run
rix::tar_nix_ga() to generate the required files
- Commit and push, and watch the actions run!
- See here.
Nix and Github Actions: writing papers
- Easy collaboration on papers as well
- See here
- Just focus on writing!
Conclusion
- Very vast and complex topic!
- At the very least, generate an
renv.lock file
- Always possible to rebuild a Docker image in the future (either you, or someone else!)
- Consider using
{targets}: not only good for reproducibility, but also an amazing tool all around
- Long-term reproducibility: must use Docker or Nix (better: both!) and maintenance effort is required as well
The end
Contact me if you have questions:
- bruno@brodrigues.co
- Twitter: @brodriguesco
- Mastodon: @brodriguesco@fosstodon.org
- Blog: www.brodrigues.co
- Book: www.raps-with-r.dev
- rix: https://docs.ropensci.org/rix